We wish to detect the handwritten text in the scanned/pdf document. It could be for number of reasons like
Take following document image for an example. We wish to detect the text highlighted in the red bounding boxes.
We will use the annotated dataset available in the following github repo https://github.com/CatalystCode/Handwriting/tree/master/Data/labelledcontracttrainingdata/trainingjpg_output_99/
The dataset is part of the Microsoft blog available here https://devblogs.microsoft.com/cse/2018/05/07/handwriting-detection-and-recognition-in-scanned-documents-using-azure-ml-package-computer-vision-azure-cognitive-services-ocr/
Each image is annotated in Pascal VOC Annotation format using Microsoft Vott Annotation tool. The directory structure of the annotated dataset looks like this
data/Annotations_99
data/JPEGImages_99
There are 99 annotated images in the dataset. The images are present in JPEGImages_99 folder and corresponding xml annotations are available under Annotations_99.
An XML annotation file looks like
<annotation verified="yes">
<folder>Annotation</folder>
<filename>07653e58-24d1-4b3f-9b4a-76057efe5c09-1</filename>
<path>C:\data\JPEGImages\07653e58-24d1-4b3f-9b4a-76057efe5c09-1.jpg</path>
<source>
<database>Unknown</database>
</source>
<size>
<width>1700</width>
<height>2200</height>
<depth>3</depth>
</size>
<segmented>0</segmented>
<object>
<name>signature</name>
<pose>Unspecified</pose>
<bndbox>
<xmin>192</xmin>
<ymin>1188</ymin>
<xmax>738</xmax>
<ymax>1320</ymax>
</bndbox>
</object>
...
In Pascal VOC annotation, there is a seperate annotation file for each image. The data we are interested in the xml file is
We will use pytorch detectron2 framework because it is simple and easy to extend. There are simple Training, Visualization, and Prediction modules available in the detectron2 which handles most of the stuff and we can use it as is, or if required, we can extend the functionality.
Simple steps to train a vision model in Detectron2
Detectron2 expects the dataset as list[dict] in the following format. So for training with detectron2 we will have to convert our dataset in the following format.
[{'file_name': 'datasets/JPEGImages/1.jpg',
'image_id': '1',
'height': 3300,
'width': 2550,
'annotations': [{'category_id': 1,
'bbox': [1050.1000264270613,
457.33333333333337,
1406.9139799154334,
587.7450980392157],
'bbox_mode': <BoxMode.XYXY_ABS: 0>},
{'category_id': 1,
'bbox': [1529.9097515856238,
473.5098039215687,
1617.167679704017,
555.3921568627452],
'bbox_mode': <BoxMode.XYXY_ABS: 0>}]}]
Detectron registers this list of dict as torch dataset and uses the default dataloader and datasampler for training. We can register the list[dict] with detectron2 using following code
def get_dicts():
...
return list[dict] in the above format
from detectron2.data import DatasetCatalog
DatasetCatalog.register("my_dataset", get_dicts)
And to register the metadata information related to dataset like category mapping to id's, the type of dataset, we have to set the keyvalue pair using
MetadataCatalog.get("my_dataset").thing_classes = ["person", "dog"]
Detectron2 has lot of pretrained model available in the model zoo. For handwritten text detection, we will choose Faster RCNN with FPN backbone.
We have to initialize the parameters and weights for model we want to train.
cfg = get_cfg()
cfg.merge_from_file('<pretrained model config'>)
cfg.MODEL.WEIGHTS = '<path to pretrained model weight>
#custom config for training
cfg.DATASETS.TRAIN = ("<registered training dataset name>",)
cfg.SOLVER.MAX_ITER = '<number of training iterations>'
cfg.MODEL.ROI_HEADS.NUM_CLASSES = '<number of classes>'
All the model configs are available in cfg object. If we want to replicate the training later, we can save the cfg object and load it back to resume training.
We will use the DefaultTrainer for now. There are simple modules available which only accept the minimal parameters and make assumptions about lot of things.
The DefaultTrainer Module
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
Now, we can train our Instance Detection model using Detectron2. We will try FasterRCNN-FPN-50 Model and see how it performs
To visualize the labeled dataset in detectron2, we need to convert the xml annotations in the detectron2 dataset format as explained above.
We will use the custom function register_pascal_voc() which will convert the dataset into detectron2 format and register it with DatasetCatalog. It expects the directory structure like
Annotations
JPEGImages
train.txt
train.txt and test.txt have a filename(without extension) per row
To draw the annotations on the images, we will use the Detectron2 Visualizer class which takes the image in rgb format, the metadata which has ordered label names and the scale parameter.
Visualizer.draw_instance_predictions() function to visualize prediction results
Visualizer.draw_dataset_dict() function to draw the annotated dataset
%matplotlib inline
import numpy as np
import os
import xml.etree.ElementTree as ET
from detectron2.data import DatasetCatalog, MetadataCatalog
from detectron2.structures import BoxMode
from fvcore.common.file_io import PathManager
import random
import cv2
from detectron2.utils.visualizer import Visualizer
from matplotlib.pyplot import figure
from matplotlib import pyplot as plt
def load_voc_instances(dirname, split, CLASS_NAMES):
"""
Load Pascal VOC detection annotations to Detectron2 format.
Args:
dirname: Contain "Annotations", "JPEGImages"
split (str): one of "train", "test", "val", "trainval"
"""
with PathManager.open(os.path.join(dirname, split+".txt")) as f:
fileids = np.loadtxt(f, dtype=np.str)
dicts = []
for fileid in fileids:
anno_file = os.path.join(dirname, "Annotations", fileid + ".xml")
jpeg_file = os.path.join(dirname, "JPEGImages", fileid + ".jpg")
tree = ET.parse(anno_file)
r = {
"file_name": jpeg_file,
"image_id": fileid,
"height": int(tree.findall("./size/height")[0].text),
"width": int(tree.findall("./size/width")[0].text),
}
instances = []
for obj in tree.findall("object"):
cls = obj.find("name").text
# We include "difficult" samples in training.
# Based on limited experiments, they don't hurt accuracy.
# difficult = int(obj.find("difficult").text)
# if difficult == 1:
# continue
bbox = obj.find("bndbox")
bbox = [float(bbox.find(x).text) for x in ["xmin", "ymin", "xmax", "ymax"]]
# Original annotations are integers in the range [1, W or H]
# Assuming they mean 1-based pixel indices (inclusive),
# a box with annotation (xmin=1, xmax=W) covers the whole image.
# In coordinate space this is represented by (xmin=0, xmax=W)
bbox[0] -= 1.0
bbox[1] -= 1.0
instances.append(
{"category_id": CLASS_NAMES.index(cls), "bbox": bbox, "bbox_mode": BoxMode.XYXY_ABS}
)
r["annotations"] = instances
dicts.append(r)
return dicts
def visualize_dataset(datasetname, n_samples=10):
dataset_dicts = DatasetCatalog.get(datasetname)
metadata = MetadataCatalog.get(datasetname)
for d in random.sample(dataset_dicts,n_samples):
print(d['file_name'])
img = cv2.imread(d["file_name"])
visualizer = Visualizer(img[:, :, ::-1],
metadata=metadata, scale=0.5)
vis = visualizer.draw_dataset_dict(d)
figure(num=None, figsize=(15, 15), dpi=100, facecolor='w', edgecolor='k')
plt.axis("off")
plt.imshow(vis.get_image()[:, :, ::-1])
plt.show()
def register_pascal_voc(name, dirname, split, CLASS_NAMES):
if name not in DatasetCatalog.list():
DatasetCatalog.register(name, lambda: load_voc_instances(dirname, split, CLASS_NAMES))
MetadataCatalog.get(name).set(
thing_classes=CLASS_NAMES, split=split, dirname= dirname, year=2012
)
#register pascal voc dataset in detectron2
register_pascal_voc('signature_dataset_train', dirname='datasets', split='train', CLASS_NAMES=["signature","others"])
visualize_dataset('signature_dataset_train',n_samples = 4)
from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultTrainer
from detectron2.engine import default_setup
from detectron2.config import get_cfg
def setup_cfg(args):
"""
Create configs and perform basic setups.
"""
cfg = get_cfg()
cfg.merge_from_file(args.config_file)
cfg.merge_from_list(args.opts)
cfg.freeze()
default_setup(cfg, args)
return cfg
parser = default_argument_parser()
args = parser.parse_args("--config-file sign_config/sign_faster_rcnn_R_50_FPN_3x.yaml OUTPUT_DIR sign_model ".split())
We have copied the config file for Faster RCNN R50 FPN from the model zoo as sign_faster_rcnn_R_50_FPN_3x.yaml and updated the configuration parameters. We have set the MODEL.ROI_HEADS(classes) to 2, Max Number of iterations to 4000, and training dataset name to the one we registered earlier.
config.setup_cfg function will load the configuration from the --config-file path, and will update the configration with other parameters passed as arguments
Here, we have passed the OUTPUT_DIR parameter to update the cfg.OUTPUT_DIR parameter value
cfg = setup_cfg(args)
Now that we have all the configurations, we can start training the model.
As I explained earlier, DefaultTrainer will build the model(without weights), optimizer, learning rate scheduler and then load weights from the checkpoint file specified in the cfg.MODEL.WEIGHTS parameter.
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
Now that the model has been trained and saved in the output directory. The config saved during the model training has all the parameters except model weight. We pass the model weight path as paramter to load the trained model weight.
The DefaultPredictor does image translation and takes only single image for prediction. But we can easily modify the DefaultPredictor class to accept batch of input images for prediction
from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultPredictor
import config
parser = default_argument_parser()
args = parser.parse_args("--config-file sign_model/config.yaml MODEL.WEIGHTS sign_model/model_final.pth".split())
cfg = config.setup_cfg(args)
predictor = DefaultPredictor(cfg)
import glob
import time
import os
from matplotlib.pyplot import figure
from matplotlib import pyplot as plt
import cv2
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
files = glob.glob("test_images/*.jpg")
sample_size = 5
for file,_ in zip(files,range(sample_size)):
im = cv2.imread(file)
MetadataCatalog.get("signature_dataset_train").thing_classes = ["signature","others"]
start_time = time.time()
outputs = predictor(im)
print(time.time()- start_time)
v = Visualizer(im[:, :, ::-1], metadata=MetadataCatalog.get("signature_dataset_train"), scale=0.5)
v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
print(file)
figure(num=None, figsize=(15, 15), dpi=100, facecolor='w', edgecolor='k')
plt.axis("off")
plt.imshow(v.get_image()[:, :, ::-1])
plt.show()
The DefaultTrainer class doesn't have a evaluator method implemented. I have created a new Trainer class and added the build_evaluator method. We could have used this new Trainer class in the first step instead of DefaultTrainer but I wanted to show how easy it is to train the model without writing more code.
from detectron2.engine import default_argument_parser
import config
import trainer
import dataset_utils
dataset_utils.register_pascal_voc('signature_dataset_train', dirname='datasets', split='train', CLASS_NAMES=["signature","others"])
#dataset_utils.register_pascal_voc('signature_dataset_test', dirname='datasets', split='test', CLASS_NAMES=["signature","others"])
parser = default_argument_parser()
args = parser.parse_args("--config-file sign_model/config.yaml MODEL.WEIGHTS sign_model/model_final.pth".split())
trainer.eval(args)
OrderedDict([('bbox', {'AP': 69.67298193268152, 'AP50': 98.10585383880746, 'AP75': 88.13255308840152})])
Let us say we have got some more annotated dataset which is not in PASCAL VOC xml format. To train the above model, we have write a custom function get_dicts() which returns data in detectron2 format.
To improve the accuracy of handwriting detection, I found one more dataset which is of annotated french documents. The annotations are in json format for each image. The dataset is available in the following github repo https://github.com/hyperlex/Signature-detection-Practical-guide/tree/master/data/dataset. Download and save it in french_dataset directory
I have added all the functions in the library files dataset_utils.py and trainer.py. We will use these abstractions to quickly train and evaluate new models
import json
import glob
import os
import cv2
from detectron2.structures import BoxMode
def get_french_dicts(annot_dir):
json_files = glob.glob(os.path.join(annot_dir,'*.json'))
dataset_dicts = []
for f in json_files:
record={}
img_ann = json.load(open(f))
filename = img_ann['asset']['name']
height, width = cv2.imread(os.path.join(annot_dir,'..',filename)).shape[:2]
record["file_name"] = os.path.join(annot_dir,'..',filename)
record["image_id"] = img_ann['asset']['id']
record["height"] = height
record["width"] = width
annos = img_ann["regions"]
objs =[]
for ann in annos:
px = ann['boundingBox']['left']
py = ann['boundingBox']['top']
px1 = ann['boundingBox']['left'] + ann['boundingBox']['width']
py1 = ann['boundingBox']['top'] + ann['boundingBox']['height']
obj = {
"bbox": [px, py, px1, py1],
"bbox_mode": BoxMode.XYXY_ABS,
"category_id": {'signature':0,'paraphe':1,'date':1}[ann['tags'][0]],
"iscrowd": 0
}
objs.append(obj)
record["annotations"] = objs
dataset_dicts.append(record)
return dataset_dicts
from detectron2.utils.visualizer import Visualizer
from detectron2.data import DatasetCatalog, MetadataCatalog
import dataset_utils
def get_img_dicts():
ann1 = dataset_utils.load_voc_instances(dirname = 'datasets', split = 'train', CLASS_NAMES=["signature","others"])
ann2 = get_french_dicts('french_dataset/per_img_labels')
return ann1 + ann2
dataset_name = 'signature_dataset_train'
DatasetCatalog.register(dataset_name, lambda: get_img_dicts())
MetadataCatalog.get(dataset_name).set(thing_classes=["signature","others"], split='train', dirname= dirname, year=2012)
len(DatasetCatalog.get(dataset_name))
dataset_utils.visualize_dataset('signature_dataset_train', n_samples=2)
We can run the remaining steps as we did before for training the model and prediction
from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultTrainer
import trainer
parser = default_argument_parser()
args = parser.parse_args("--config-file sign_config/chk_faster_rcnn_R_50_FPN_3x.yaml --num-gpus 3 OUTPUT_DIR french_sign_model SOLVER.MAX_ITER 4000".split())
trainer.train(args)
from detectron2.engine import default_argument_parser
import config
import trainer
import dataset_utils
import dataset_utils
dataset_utils.register_pascal_voc('signature_dataset_test', dirname='datasets', split='train', CLASS_NAMES=["signature","others"])
parser = default_argument_parser()
args = parser.parse_args('--config-file french_sign_model/config.yaml MODEL.WEIGHTS french_sign_model/model_final.pth DATASETS.TEST ("signature_dataset_test",)'.split())
trainer.eval(args)
OrderedDict([('bbox', {'AP': 67.82590895115814, 'AP50': 96.77629667360196, 'AP75': 84.18387912626437})])
The Average Precision has reduced compared to the previous model. Let us check the prediction results.
from detectron2.engine import default_argument_parser
from detectron2.engine import DefaultPredictor
import config
parser = default_argument_parser()
args = parser.parse_args("--config-file french_sign_model/config.yaml MODEL.WEIGHTS french_sign_model/model_final.pth".split())
cfg = config.setup_cfg(args)
predictor = DefaultPredictor(cfg)
import glob
import time
import os
from matplotlib.pyplot import figure
from matplotlib import pyplot as plt
import cv2
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog
files = glob.glob("test_images/*.jpg")
sample_size = 5
for file,_ in zip(files,range(sample_size)):
im = cv2.imread(file)
MetadataCatalog.get("signature_dataset_train").thing_classes = ["signature","others"]
start_time = time.time()
outputs = predictor(im)
print(time.time()- start_time)
v = Visualizer(im[:, :, ::-1], metadata=MetadataCatalog.get("signature_dataset_train"), scale=0.5)
v = v.draw_instance_predictions(outputs["instances"].to("cpu"))
print(file)
figure(num=None, figsize=(15, 15), dpi=100, facecolor='w', edgecolor='k')
plt.imshow(v.get_image()[:, :, ::-1])
plt.show()